---
title: "Lab 10"
format:
html:
echo: true
---
Please note that there is a file on Canvas called [**Getting started with R**](https://canvas.swansea.ac.uk/courses/54287/modules){.external target="_blank"} which may be of some use. This provides details of setting up R and Rstudio on your own computer as well as providing an overview of inputting and importing various data files into R. This should mainly serve as a reminder.
Recall that we can clear the environment using ```rm(list=ls())``` It is advisable to do this before attempting new questions if confusion may arise with variable names etc.
### Example 1 {.unnumbered}
The **Factor Analysis (SPSS Anxiety QA)** dataset contains the results of a questionnaire which aims to determine the cause of SPSS anxiety amongst students. In this example we will use a factor analysis to try to find the underlying causes of this anxiety. We will assume the assumptions of linearity (a matrix of scatterplots may be used for this) and no outliers, but we will investigate the assumptions of no extreme multicollinearity and factorability. The data is ordered categorical and $n=2571$, satisfying the data type and sample size assumptions. Follow the steps below,
* We firstly import, check and attach the data:
```r
library(foreign)
Anxiety<-read.spss(file.choose(), to.data.frame = T)
```
```{r, echo=FALSE}
library(foreign)
Anxiety<-read.spss("Factor Analysis (SPSS Anxiety QA).sav", to.data.frame = T)
```
```{r, results='hide'}
head(Anxiety)
summary(Anxiety)
attach(Anxiety)
```
* Next we create a list of variables that we wish to reduce.
```{r}
q<-cbind(Question_01,Question_02, Question_03, Question_04, Question_05, Question_06, Question_07, Question_08, Question_09, Question_10, Question_11, Question_12, Question_13, Question_14, Question_15, Question_16, Question_17, Question_18, Question_19, Question_20, Question_21, Question_22, Question_23)
```
* In order to check the no extreme multicollinearity assumption we calculate det(correlation matrix):
```{r}
cor<-cor(q)
det(cor)
```
* We see that $\det(\text{correlation matrix})=0.0005>0.00001$ which satisfies the no extreme multicollinearity assumption.
* We next check the factorability using the KMO test and Bartlett's test using the **psych** package.
```r
install.packages("psych")
```
```{r, warning=FALSE}
library(psych)
KMO(cor)
cortest.bartlett(cor)
```
* The KMO statistic is $0.93>0.5$ and Bartlett's test is significant, therefore satisfying the factorability assumption.
* In order to determine the number of factors we calculate and plot the eignvalues.
```{r}
eigen<-eigen(cor)
val<-eigen$values
val
plot(val, type="b", ylab="Eigenvalue", xlab="Factor", main="Scree plot")
```
* We see from the output that there are 4 factors - the Kaiser criterion shows 4 factors (eigenvalues $>1$) and the scree plot confirms this.
* We now proceed with the factor analysis with 4 factors.
```r
install.packages("GPArotation")
```
```{r, message=FALSE}
library(GPArotation)
fa<-fa(q,nfactors = 4, rotate="none",fm="pa")
fa
```
* An inspection of the **Pattern Matrix** in the output shows that it is currently difficult to identify these factors (no clear distinction between high and low loadings). We therefore perform a rotation. We first employ an oblique Oblimin rotation.
```{r}
fa2<-fa(q,nfactors = 4, rotate="oblimin",fm="pa")
fa2
```
* Since not all of the $|\text{factor correlations}|> 0.32$ we use a varimax rotation using:
```{r, results=FALSE}
fa3<-fa(q,nfactors = 4, rotate="varimax",fm="pa")
fa3
```
* In order to make identifying the factors easier we hide loadings smaller than 0.3 using the following R code:
```{r}
print(fa3$loadings, digits=2, cutoff=.3, sort=TRUE)
```
* An inspection of the rotated pattern matrix now shows the factors more clearly. In particular, we see that:
* Factor 1 represents questions directly related to statistics (Questions 1,3,4,5,12,16,20,21);
* Factor 2 represents questions directly related to computing (Questions 6,7,10,13,14,15,18);
* Factor 3 represents questions directly related to mathematics (Questions 8,11,17);
* Factor 4 represents questions related to peer pressure (Questions 2, 9,19,22,23).
### Exercise 1 {.unnumbered}
Perform a factor analysis on the **Factor Analysis (Scholarship Tests)** dataset to determine the main underlying factors in the variables. It is important to note that the variables **pairnum**, **sex** and **zygosity** should **not** be included in your analysis as they are not in the correct form, i.e. they are not ordered categorical with 5 or more categories, nor are they continuous.
With respect to assumptions, linearity and no outliers may be assumed, however all other assumptions must be checked.
**Hint:** This dataset contains a number of ``NA" entries which cause a problem for the analyses. These can be removed using the following code (assuming the name allocated to the dataset when importing into R is **Scholarship**):
```r
Scholarship2<-na.omit(Scholarship)
attach(Scholarship2)
head(Scholarship2)
```
### Example 2 {.unnumbered}
In this example we will use a principal component analysis (PCA) to try to reduce the number of variables in the **Iris** dataset into a smaller number of components. The dataset contains information on iris plants and clearly we ignore the **species** variable as it is not in the correct format for the procedure.
As in the factor analysis example, we will assume the assumptions of linearity (scatterplots may be used for this) and no outliers, but we will investigate the assumptions of no extreme multicollinearity and factorability. The data is ordered categorical and $n=150$, which is sufficient. Follow the steps below:
* We use the in-built dataset Iris for this example. We invoke and check the dataset using the following code in R:
```{r}
library(datasets)
head(iris)
attach(iris)
```
* Next we create a list of variables that we wish to reduce.
```{r}
x<-cbind(Sepal.Length,Sepal.Width, Petal.Length, Petal.Width)
```
* In order to check the no extreme multicollinearity assumption we calculate det(correlation matrix):
```{r}
cor3<-cor(x)
det(cor3)
```
* From the above output, we see that $\det(\text{correlation matrix})=0.008>0.00001$ which satisfies the no extreme multicollinearity assumption.
* We next check the factorability using the KMO test and Bartlett's test.
```{r, warning=FALSE}
library(psych)
KMO(cor3)
cortest.bartlett(cor3)
```
* The KMO statistic is $0.54>0.5$ and Bartlett's test is significant, therefore satisfying the factorability assumption.
* We next perform PCA with the in-built command.
```{r}
pca <- prcomp(iris[c(1:4)], center = T, scale=T)
summary(pca)
```
* We also see from the output that the Kaiser criterion suggests only 1 component (the eigenvalues are given in the "Standard deviation" row). However, closer inspection of the scree plot shows a clear cut-off after two components. Use the code below to produce the scree plot:
```{r}
plot(pca, type="l")
```
* Sometimes a **biplot** can help visualise the components:
```{r}
biplot(pca, scale=0)
```
* We may also use the component loadings (using the 2 identified components) to identify the components:
```{r}
library(psych)
pca2 <- principal(x,2,rotate="none")
pca2
```
* As it is not clear from the component loadings what the components represent, we perform a rotation (oblique first)
```{r}
library(GPArotation)
pca3 <- principal(x,2,rotate="oblimin")
pca3
```
* Since not all of the $|\text{component correlations}|> 0.32$ we use a varimax rotation:
```{r}
pca4 <- principal(x,2,rotate="varimax")
pca4
```
* Removing the smaller loadings with the following code makes identifying the components easier.
```{r}
print(pca4$loadings, digits=2, cutoff=.4, sort=TRUE)
```
* Using this output, together with the biplot above, we see that one component is made up of sepal length, petal length and petal width, while the second component is sepal width only.
### Exercise 2 {.unnumbered}
Perform a principal component analysis on the **Places** dataset to determine the main underlying components in the variables. With respect to assumptions, linearity and no outliers may be assumed, however all other assumptions must be checked.